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ABSTRACT 



Introduction. Case Based Reasoning (CBR) is an emerg- 
ing decision making paradigm in medical research where new 
cases are solved relying on previously solved similar cases. 
Usually, a database of solved cases is provided, and every 
case is described through a set of attributes (inputs) and a la- 
bel (output). Extracting useful information from this database 
can help the CBR system providing more reliable results on 
the yet to be solved cases. 

Objective. For that purpose we suggest a general frame- 
work where a CBR system, viz. K-Nearest Neighbor (K-NN) 
algorithm, is combined with various information obtained from 
a Logistic Regression (LR) model. 

Methods. LR is applied, on the case database, to assign 
weights to the attributes as well as the solved cases. Thus, 
five possible decision making systems based on K-NN and/or 
LR were identified: a standalone K-NN, a standalone LR and 
three soft K-NN algorithms that rely on the weights based 
on the results of the LR. The evaluation of the described ap- 
proaches is performed in the field of renal transplant access 
waiting list. 

Results and conclusion. The results show that our sug- 
gested approach, where the K-NN algorithm relies on both 
weighted attributes and cases, can efficiently deal with non 
relevant attributes, whereas the four other approaches suffer 
from this kind of noisy setups. The robustness of this ap- 
proach suggests interesting perspectives for medical problem 
solving tools using CBR methodology. 

Keywords. Case-based Reasoning systems; logistic models; 
similarity measures; k-nearest neighbors algorithms; classi- 
fication. 
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1. INTRODUCTION 

1.1. Case Based Reasoning for Medical Applications 

Case-based reasoning (CBR) is a problem-solving paradigm 
emerging in medical decision-making systems H). Instead 
of relying solely on general knowledge of a problem domain, 
CBR utilizes the specific knowledge of previously experienced, 
concrete problem situations - also referred to as cases - to 
tackle new ones |0 . 

More specifically, CBR methodology defines a general 
CBR cycle composed of four steps centered around a case 
database 1 3 1 . First, the decision making process needs to iden- 
tify, among the solved cases, those that seem to be the most 
similar to the considered unsolved case. Then, solve the new 
case relying on the knowledge extracted from the most sim- 
ilar solved cases. The third step consists in evaluating the 
suggested solution for the new case. Finally, if the solution is 
found satisfactory, the decision making process usually stores 
the part of the experiment likely to be useful for future prob- 
lem solving. 

CBR in biology and medicine has found one of its most 
fruitful application areas and appears particularly suited to 
designing decision making tools in the field of Health sci- 
ences 0J. Indeed, Medicine appears as a highly intensive- 
data field where it is advantageous to develop systems capable 
of reasoning from pre-existing cases such as from electronic 
health record repositories for instance. 

1.2. Problem Definition and Objectives 

This paper focuses on the two first steps of the CBR cycle, 
viz. retrieve and reuse solutions from previously experienced 
situations. 

Knowledge in CBR systems consists of cases. Each case 
is a problem description linked to its solution. For solving 
new problems, the decision making process requires to se- 
lect relevant cases, by measuring similarity of common char- 
acteristics between the new and the previously experienced 
cases (5). 

In accordance with the traditional CBR view, the knowl- 



edge database contains cases, which consist in a problem- 
specific definition and construction. Thus, there are as many 
case bases as problems to be solved. Bergmann et al. over- 
come that problem by introducing concept of utility [6 1. Sim- 
ilarity measures are not directly computed from the problem 
descriptions of new and previously experienced cases, they 
are computed with the description of their utility; utility de- 
scription being specifically defined in accordance with the so- 
lution needed. 

Statistical analyses and regression modeling could be use- 
ful to introcuce utility description in CBR systems, by con- 
verting medical data sources - or data bases - into medical 
case bases. Regression models contain a part of knowledge 
which may be used to define utility description of cases and 
to perform problem-specific measures of similarity. The pa- 
per precisely consists of such an illustration by the formal 
definition and evaluation of a traditional CBR retrieval algo- 
rithm 'the K -Nearest Neighbor (K-NN) algorithm' coupled 
with a logistic regression model. 

The rest of the paper is organized as follows : First, Sec- 
tion 12 specifies the scope the paper. Then, Sections [3] and |4] 
respectively detail the decision making model and the consid- 
ered learning process. Section |5]focuses on the implementa- 
tion, evaluation and interpretation of the suggested methodol- 
ogy. Finally, Section [6] discusses related works and perspec- 
tives. 

2. SCOPE OF THE STUDY 

2.1. Domain Application and Data Source 

To carry out this work, we used data from the French Renal 
Epidemiology and Information Network (REIN) registry [7 1 
related to renal replacement therapies (RRT) for end-stage re- 
nal disease (ESRD), and data from the Agence de la Biomedecine, 
the French national agency of organ transplantation for regis- 
tration on the waiting list of kidney transplantation. 

Registration on the waiting list is a medical decision based 
on medical factors in accordance with French medical guide- 
lines that do not really need automated decision-making sup- 
port. Nevertheless, those data and their domain application 
were chosen for several reasons: 

• Data come from a national registry that confirms the 
data quality by the French Comite National des reg- 
istres agreement. 

• Many studies showed that the selection criteria on the 
waiting list diverge from one center to another, and that 
access to the renal transplant waiting list is influenced 
by both medical and non medical factors [8 1. 

• Recent studies showed that it is possible to predict ac- 
cess to the waiting list relying on some of these fac- 
tors flUHUl. 



• Our main objective is a methodological essay on com- 
bination of CBR retrieval algorithm with logistic re- 
gression, and not the implementation of a medical de- 
cision support. 

2.2. Study Population and Data Collection 

The study population consists of every incident ESRD pa- 
tients in Brittany, limited to those who started an RRT (peri- 
toneal dialysis or hemodialysis) between January the 1 st, 2004 
and December the 31th, 2008. Patients who received a pre- 
emptive transplant and patients who came back on the waiting 
list after a first transplant have been excluded. 

Registration status on the transplant waiting list was com- 
puted relying on the date of the first RTT as well as the date 
of registration on the waiting list. Only patients recorded on 
the waiting list within 12 months after inclusion on the REIN 
registry have been considered as registered patients. 

A set of description factors have been defined according 
to data availability of the REIN database and the renal trans- 
plant scientific literature [8. 1 1-14- j . All factors have been di- 
chotomized, i.e., reduced to a binary value. Three categories 
of factors likely to be related to registration on the transplant 
waiting list have been studied: 

• Social and demographic factors: sex, age and current 
occupation at the first RRT. 

• Clinical and biological factors at the first RRT: exis- 
tence of hypertension, diabetes, chronic respiratory fail- 
ure, chronic heart failure, ischemic heart disease, heart 
conduction disorder or arrhythmia, positive serology 
(HCV, HBV, HIV), liver cirrhosis, disability, past his- 
tory of malignancy and hemoglobin as <11 g/dl and 
>llg/dl. 

• Factors related to medical care: ownership of nephrol- 
ogy facility where the first RRT were performed (pri- 
vate or public), follow-up in institution performing trans - 
plantation, type of first RRT (hemodialysis or peritoneal 
dialysis), urgent versus planned first dialysis session 
and first catheterization. 

Due to missing data (>10%), some factors potentially re- 
lated to registration on the waiting list have not been consid- 
ered either for statistical analyses or CBR algorithms: dis- 
tance from patients residence to the transplantation depart- 
ment, smoking status, body mass index, vascular comorbidi- 
ties and serum albumin level. 

3. DECISION MAKING MODEL 

3.1. Decision Making Process and mathematical notations 

We depict, in this subsection, the overall mechanism designed 
to predict patient accessibility to renal transplant waiting list. 



Upper case notations refer to vector (or a set of vectors, viz., a 
matrix) whereas lower case notations refer to scalar real vari- 
ablefl Curved notations denote sets of elements. 

For the sake of generality, Let ir refer to the decision mak- 
ing process considered hereafter. Moreover, let Cl refer to a 
set of labeled cases, viz. patients, and let Cjj refer to a set 
of new analyzed cases. We aim at designing a decision mak- 
ing process that maps new cases to previously solved (i.e., 
labeled) cases. 

We consider two possible classes: as matter of fact, a pa- 
tient is either registered in the renal transplant waiting list or 
not. Consequently, the labels are assumed to be binary, let 
y p G {0, 1} denote the label assigned to patient p E Vc, 
where Vc refers to the set of patients considered in Cl- 

The set of cases consists, in either case-sets -labeled, Cl, 
or not Cjj- of a set of patients, Vl (or Vjj respectively), and 
two sub-sets: A and Vl (or Vjj in the case of Cjj) named re- 
spectively, Attribute-set and Value-set. On the one hand, A 
represents the set of elements that characterize a case such as, 
social and demographic data (e.g., age, gender and current oc- 
cupation for instance) and, clinical as well as biological data 
(e.g., existence of hypertension, diabetes, chronic respiratory 
failure, chronic heart failure, to name a few) The set A is 
considered common to both Cl and Cjj- On the other hand, V 
(i.e., either one of the sets Vl and Vjj) represents a set of vec- 
tors related to the considered attributes for every patient: Let, 
v a , P refer to the value assigned to the attribute a G A for the 
patient p£p (i.e., either one of the sets Vl and Vjj). For the 
sake of ease of representation, V can be seen as a matrix of 
siz^l | A x \V\, where every cell contains a value v a iP . For ev- 
ery attribute a, a patient p, can either verify the attribute a or 
not. Consequently, u a p can only take a binary value in {0, 1}, 
where 1 refers to attribute verified and otherwise. Thus, V p 
refers to a vector of |.4| binary elements that represents the 
condition of a patient p G V regarding a set of attributes A. 

As previously mentioned, the set of patients Vl consid- 
ered in Cl is already labeled. The set of labels y p are stored 
in a vector Y. 

Finally, we can see the decision making process ir as a 
function that classifies unlabeled patients in the set Vl rely- 
ing on the similarity of the unlabeled patients with the set of 
labeled patients. Let S refer to the vector of labels provided 
by the decision making engine, where every patient p G Vjj 
is assigned a numerical value s p G [0, 1], such that for every 
patient p G Vu- 

s p = TT{{v a ^ p } a£A ,CL,Y} (1) 

where s p quantifies the possible proximity of patient p to the 
possible classes in Cl. If s p is a binary value, i.e. s p G {0,1}, 

An exception is made for the scalar parameter K of the ii'-NN algorithm 

for the sake of consistency with the literature. 

2 The complete set of criteria is further detailed in the subsection l2.2l 
3 The notation |.4| X \P\ represents the value of the product of the cardinal 

of both sets A and V. 



the decision making policy ir is referred to as a hard classifi- 
cation. Otherwise, it is usual to speak of soft classification. 
We consider in this paper this latter approach. 

In the context of CBR, the decision maker assigns a label 
to new cases depending on their similarity with previously 
solved cases. The assignment relies on a measure that quan- 
tifies the resemblance of the analyzed case with the set of la- 
beled cases. Such decision making approach mimics the de- 
cision making process of a physician when dealing with new 
patients for instance. To do so, the decision maker needs to 
assess the importance of the different factors as well as the 
reliability of the cases, i.e. patients, dealt with in the past. 

In this paper, the designed CBR relies on a soft K -NN 
algorithm, perhaps one of the most widely used technology 
in CBR fl5l . Namely, rather than assigning a label to either 
classes, we compute a probability of being assigned such la- 
bels. Such probability is computed relying on the K most 
similar patients already labeled. A simple threshold decision 
making would lead to a hard classification process. 

Designing our decision making mechanism requires esti- 
mating the distance between patients as well as qualifying the 
reliability of the labeled patients. These notions are discusses 
in the next subsections. 

3.2. Similarity Metric and Attributes' weights 

Ideally speaking, similar patients should belong to a same 
class (registered or not registered). Similar patients usually 
express similar values to their respective attributes. Equiv- 
alently, to the notion of similarity, we can define a distance 
measure that quantifies the proximity of the new patient to 
treat with the previously seen patients (i.e. the labeled set of 
patient). The larger the similarity measure is the smaller be- 
comes the distance. 

For the sake of simplicity we define, in this paper, the 
distance measure as follows. Let p and p' denote two patients 
(label or unlabeled), the distance between these patients is 
quantified through the measure: 

d(p,p') = ^ w «- i 1 ~ v a,P ® V °;P') 
aeA 

where refers to the exclusive OR (XOR) operator and such 
that: _^ 

aeA 

Where, oj a denotes the weight assigned to attribute a G A, 
and the similarity measure appears equal to: 

^ ^a v a,p © v a,p' 
aeA 

The weights {oJa]aeA are, usually, not known a priori. 
Therefore, the decision maker needs to acquire that informa- 
tion through a learning process. Thus, relying on the labeled 



set of cases, the decision maker estimates the impact of the 
various attributes considered. This step is discussed in Sec- 
tion!?] where all required learning steps are detailed. 

3.3. Soft i^-Nearest Neighbor Algorithm 

K -NN Algorithms refer to simple classification techniques 
that assign labels to new cases depending on their similarity 
with a reference set of already labeled cases. 

Thus, for every new patient p to label, p 6 Vu, a K- 
NN algorithm operates through mainly two major steps, the 
selection step and the fusion step: 

Selection Step: 

• Computes first the similarity of patient p with patients 

• Sort the similar patients p' 6 Vl according to their 
similarity measure. 

• Select the K most similar patients p' . 

Fusion Step: Compute a numerical value that quantifies 
the proximity of the new case (i.e. Patient p) to the set of 
possible classes in the training set (i.e. Cl). 

Depending on this last step, a decision maker can, if needed, 
assign a label to the new case. Usually a threshold based clas- 
sifier is used for the assignment process. This latter is how- 
ever out of the scope of this paper. 

Let V* K refer to the optimal iT-NN set obtained after the 
selection step. More specifically V* K contains the K labeled 
patient -stored in Vl- that have the largest similarity measures 
with respect to the currently analyzed patient p 6 Vu- The 
fusion step consists in quantifying the possible outcome of the 
decision making process. 

Finally, the outcome of the decision making process, s p 
for a patient p e Vu is defined as: 

w p /d(p,p') _1 yp' 

Sp ~ £ ^d(p,p')- 1 { ) 

P 'ev* K 

where the set of patients' weights is denoted by the variables 
{uj' p \ p i e -p L , and {y' p } P 'ev L are the labels assigned to the la- 
beled cases as defined in the subsection 13.11 The weights 
{ujpi } p i£-p L are designed to verify: 

w p' = 1 

p'er L 

We conclude this subsection discussing, briefly, the set- 
tings of the K-NN model: i.e., the selection of an appropriate 
value K. Usually, it is not possible to define, a priori, the 
value of the parameter K. Thus, a setting phase is necessary 
to evaluate a satisfactory value with respect to a learning set. 



The setting phase consists in three steps. First, a specific 
subset Cs of the learning learning set Cl, Cs C Cl, is de- 
fined. We refer to this subset as setting set in Section|5] Then 
an evaluation metric that quantifies how well behaves the K- 
NN algorithm on the setting set is computed for the integers 
(1, 2, • • ■ , K max ) smaller than a specified limit K max . Fi- 
nally the smallest integer K e {1, 2, • • • , K max } that max- 
imizes the evaluation metric is kept and used on the set Cu 
during the learning process. This procedure is further dis- 
cussed in Section[5] 

4. LEARNING PROCESS BASED ON LOGISTIC 
REGRESSION 

This section deals with the learning phase. As a matter of 
fact, in order to implement the K-NN based CBR, we need to 
compute, on the one hand, the parameters {uj a } a ^A to evalu- 
ate the similarity between patients, and on the other hand, the 
parameters {u! p } p& -p L in order to evaluate the importance -or 
contribution- of each patient inVu- We consider the scenario 
where the set of parameters is computed once relying on the 
labeled cases. Then they are exploited to solve new cases. 

4.1. Logistic Regression 

In a nutshell, Logistic Models (LM) are useful to predict the 
presence or absence of an outcome or a characteristic based 
upon the values of a set A of predictor variables. The methods 
fits regression model for binary response data relying on the 
maximum likelihood method [16|. More specifically, in this 
paper we consider the following definition: 

Definition 1 (Logistic Regression) Let A denote a set of ex- 
planatory variables, Vl a set of cases, V a binary matrix 
in {0, lj-l-^WI such that {V} a , P = v CLiP with a 6 A and 
p G Vl, and finally, let Y refer to a vector of binary expert 
outcomes (e.g., registered or not registered). 

LR assumes that there exist an underlying LM that can 
explain the decision outcomes Y as a logistic function of the 
matrix V and a vector of regression parameters (3 £ M}^ +1 . 

Then LRfits the data in V to a logistic function such that 
for any case p £ V characterized by a vector of values of the 
set A: 

(l + e-E. 6 x _1 

where {{/3 a }{aeA}i flo} represent maximum likelihood esti- 
mated regression parameters and y p , in [0, 1] the estimated 
prediction outcome for any analyzed case p. 

In DefinitionQ] the regression coefficients reflect the rela- 
tive influence of predictor factors to define cases' registration 
on the waiting list. Thus it is natural to take them into ac- 
count when computing the weights of the attributes A and the 
patients Vl as described in Section [3] This matter is further 
detailed in next subsection. 



4.2. Weighting of Attributes and Patients 



5. EXPERIMENTAL PROTOCOL AND RESULTS 



Significance of each factor, when the regression provides max- 
imum likelihood estimates, is based on the Wald's test defined 
as follows: 

Definition 2 ( Wald Statistic and Weighting of Attributes) 

Let {Pa}{aeA} denote a vector of maximum likelihood esti- 
mates and {& a }{aeA} their respective maximum likelihood 
standard deviations. Then Wald's statistic with respect to the 
attribute a <E A is defined as: 

Wald a = % 

Finally, the vector of weights of attributes, {(jJa}aeA> is de- 
fined such that: 

Wald a 
Ea'GA Wald a , 

When dealing with the set of labeled cases Cl, LR intro- 
duces a gap between the stored binary outcomes Y and the 
predicted soft outcomes Y. For every p G Vl, the value of 
the gap equals (y p — y p ). Relying on the definition of Pear- 
son residuals, we introduce the cases' attributes {wp}{ pe p t } 
as follows: 

Definition 3 (Weighting Cases) Let p e Vl denote a la- 
beled case, y p its label and y p the logistic regression outcome. 
Pearson residuals are defined as: 

= V P - V P 
V y/VpO- - Vp) 
where e p is assumed to be drawn from a standard normal dis- 
tribution. Thus uj p is defined as: 

p(IMI) 
Wp Ep^Pfllvll) 

where \\ ■ \\ refers to the absolute value function and P(-) 
refers to the probability density function of a standard normal 
distribution. 

We end this section introducing a last notation for the sake 
of clarity. Usually, many training phases are needed in order 
to estimated all the parameters of a complete decision mak- 
ing process. In such case, the labeled set Vl needs to be 
divided and distributed among the different phases. In this 
paper, the parameters of both the LM the K-NN algorithm 
need to be learned. Thus the set Vl needs to be subdivided 
into two sets Vs, introduced in previous section, for the sake 
of the algorithm if-NN, and a set Vt, referred to as training 
set, dedicated to the estimations of LM parameters. Finally, 
Vl = Vt U Vs and since Vt and Vs must not overlap, i.e., 
they contain no common cases We can write, to conclude this 
section, that their intersection is empty: Vt H Vs = 0. 

The rest of the paper focuses on the implementation, eval- 
uation and interpretation of this methodology. 



5.1. Data description: Training, Setting and Evaluating 
sets 

The initial population included 1647 patients who began an 
ESRD treated by dialysis (652 (40%) women and 995 (60%) 
men). Among them, 350 i.e., 21%, have been registered on 
the waiting list of renal transplantation in the first year fol- 
lowing the start of RRT. 

Unfortunately, patients' data with respect to the selected 
explicative variables (Cf Subsection 12.21 for further details), 
were not always complete or fully available. Since, logistic 
models cannot deal with missing data, we decided to restrict 
this analysis to a subset of patients with no missing data. 

Thus, the study population was reduced to 1137 patients 
with complete data, which only represent 70% of the initial 
population. It is worth mentioning that the general caracter- 
istics of this population remain similar to the original popula- 
tion. As a matter of fact, the population still included a major- 
ity of men (692 men, 61%) and the rate of patients registered 
on the waiting list remains similar to the original population 
(255 patients, 23%). For the rest of this section, we only fo- 
cus on the 1137 patients with complete data. We denote this 
set of patients V as introduced in previous sections. 

Thus, the set of patients V is such that \V\ = 1137. For 
the sake of the experiment, V is distributed into two sets: Vl 
and Vtj. On the one hand, the set Vl represents the labeled 
set that we use for training the LM as well as for setting the 
parameter K of the K-NN algorithm, while on the other hand, 
we kept a set Vjj, considered as the unlabeled data on which 
we apply our methodology, for the evaluation phase. The la- 
beled set is also partitioned into two sets: Vl — Vt U Vs- 
The training set Vt is dedicated the LM, while the setting set 
Vs is used to estimate an appropriate X-value of the K-NN 
algorithm. 

Finally, the training database, the setting database and and 
the evaluation database are built relying on a random sam- 
pling for the set population set, such thaQ 

\V T \ = \V S \ = \Vu\ = 379 

A Chi-Square test was performed to verify that all three sets 
share common characteristics. The Chi-Square test showed 
no significant difference between the three databases. 

5.2. Experimental Protocol 

The key aims of this subsection are twofold. On the one hand, 
we describe the algorithms considered in this experimental 
section and compare them to the overall approach detailed 

4 It is worth mentioning that no specific filtering was used to obtain the 
same number of patients in all three databases. It is a simple coincidence that 
occurred after discarding patients with incomplete data. 
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Fig. 1: Experimental Protocol. During the learning phase, a training set is used to compute the parameters of a logistic regression 
model. These parameters enable the computations of the weights of attributes as well as patients' weights. Then a setting set is 
used to evaluate an optimal K value for the K-NN algorithm. Finally all these estimates are exploited to evaluate five decision 
making algorithms referred to by the indexes (i) to (v). 



hereabove. On the other hand, we present the evaluation cri- 
teria considered in this paper to assess the quality of the dif- 
ferent simulated approaches. 

As discussed in previous Sections, we consider in this pa- 
per the combination of a case based reasoning approach, viz. 
K -NN algorithm, with a logistic regression model. Moreover, 
in order to enhance its behavior, we suggested several weigh- 
ing parameters that capture the relevance of the explicative 
variables and the labeled cases. In order to evaluate the sug- 
gested approach, we propose to simulate five different algo- 
rithms analyzed within two scenarios. 

The five algorithms combine different elements described 
in Sections [3] and |U First we simulate, separately, the two 
main algorithms describes in previous sections: 

• ( i) The standalone logistic regression algorithm. 

• (ii) The standalone if-NN algorithm (also referred to 
standalone CBR algorithm in the rest of the paper). 

Both algorithms were extensively studied and know to be 
efficient prediction tools. In order to analyze the benefit of 
weighting the attributes and/or the patients, we start by sim- 
ulating the standalone versions. Then we progressively add 
the weighting variables introduced in Subsections l3.2l and l4.2l 
This results into three other approaches to consider. Thus, we 
can enumerate the following algorithms: 

• ( Hi) A K-NN with weighted attributes (also referred to 
as CBR+bj a in the simulation results). 

• (iv) A if-NN with weighted patients (also referred to 
as CBR+uj p in the simulation results). 

• (v) A K-NN with both weighted attributes and weighted 
patients (also referred to as CBR+uj a +uj p in the simu- 
lation results). This latter is the suggested approach of 
this paper. The four other algorithms are used as com- 
parison material. 

All five algorithms are computed within two scenarios: 
on the one hand, 19 explicative variables, i.e., attributes, that 
comply with the general medical model are used. This first 
scenario analyses the performances of these algorithms when 
the variables are already reliable form the empirical point of 
view. On the other hand, 50 additional attributes randomly 
defined are considered in the second scenario in order to eval- 
uate the robustness of the simulated algorithms with respect 
to uncertain models. Namely, the objective is to study the be- 
havior of the prediction tools when the knowledge database 
contains factors not related to the prediction object. 

Moreover, in every scenario we evaluate the benefit of au- 
tomated variable selection for LR before simulating the al- 
gorithms. Thus for every scenarios, we describe two sub- 
scenarios. We refer to them in simulations as the sub-scenarios 
Prediction using all attributes and Prediction using selected 



attribute^ All scenarios and algorithms are summarized in 
Figure ??. 

All performance results are presented in terms of the re- 
ceiver operating characteristic curve (AUC). In order to com- 
pute confidence intervals of AUC results, a bootstrap resam- 
pling procedure is performed [17]. Thus, the probability dis- 
tribution of AUC statistic is simulated by 500 random sam- 
ples from the original evaluation database. Then a specific 
non parametric Monte Carlo AUC estimator, AUC, is com- 
puted. The chosen estimator is a non biased AUC estimator 
such that: 

k 

where the index b refers to the bootstrap iteration and k is the 
total number of iterations (A; = 500 in this case). 

We computed the performance evaluation estimates such 
that the confidence intervals limits are the 2.5 and 97.5 per- 
centiles of the AUC distribution. 

5.3. Computational Tools 

All computations involved in this study, including LM based 
regression and CBR algorithms, were performed on the free 
software environment 'R' @. 

More specifically, we relied on the package 'stats' (ver- 
sion 2.12.2) to implement logistic regression. As a matter of 
fact, it allows modeling generalized linear models thanks to 
the 'glm' function. Then, the functions Anova' and 'sum- 
mary' enabled the estimation of our LM parameters. Finally, 
the function 'step' was used for selecting LR variables relying 
on a stepwise procedure and on Akaike's criterion. 

Related to CBR algorithms, we designed our specific func- 
tions using the programming language of the R user interface 
to ensure calculation of similarity measures, selection of near- 
est neighbors, prediction of probability to be registered, and 
classification of cases. 

5.4. Results 

Table Q] shows the weights of attributes calculated from the 
Wald statistics using the regression coefficient estimations of 
the LM, as defined in Subsection 14.21 and their respective 
standard deviations. Both sub-scenarios, summarized in Fig- 
ure Q] are considered where estimations are conducted after 
(or without) a stepwise selection procedure on the set of ex- 
plicative variables (viz, attributes). The results of Table ?? 
consider first the case database with only 19 attributes rele- 
vant to our problem (referred to as before adding of 50 ran- 
dom factors). Then, 50 random attributes are added and the 
computations of both sub-scenarios are once again repeated. 

5 The selection procedure can be referred to as stepwise selection. 
6 Version 2.12.2 GUI 1.36 Leopard build 32-bit for Mac OSX(T8l. 







Before adding of 50 random factors 


After adding of 50 random factors 






Before attribute selection 


After attribute selection 


Before attribute selection 


After attribute selection 


Social and demographic factors 


Sex 


0,0% 




0,2% 






Age* 


65,4% 


68,8% 


12,2% 


23,9% 




Current occupation* 


2,5% 


2,7% 


1,3% 


1,3% 


Clinical and biological factors 


diabetes (type 1 or 2} 


1,0% 




2,7% 


2,5% 




Hypertension 


5,2% 


5,1% 


5,2% 


4,8% 




Chronic respiratory failure 


0,4% 




2,4% 


1,9% 




Chronic heart failure 


2,0% 




1,3% 


2,2% 




Ischemic heart disease 


5,7% 


7,3% 


2,0% 


1,3% 




Heart conduction disorder (or arrythmia) 


0,2% 




0,8% 


1,2% 




Past history of malignancy 


6,1% 


4,5% 


3,1% 


4,3% 




Positive serology (HCV, HBV, HIV}t 


1,3% 




1,4% 






Liver cirrhosis 


0,9% 




1,0% 


1,9% 




Disability 


2,7% 


3,0% 


1,5% 


1,5% 




Hemoglobin (< or > llg/dl) 


0,0% 




0,0% 




factors related to medical care 


Ownership of nephrology facilities (private or public) 


3,4% 


5,9% 


0,1% 






Institution performing transplantation 


3,1% 


2,8% 


0,1% 






Hemodialysis or perotoneal dialysis* 


0,0% 




1,4% 


2,6% 




Urgent or planned dialysis session* 


0,0% 




0,1% 






Urgent or planned first catheterization 


0,0% 




0,2% 


1,8% 


Random factors 




0,0% 


0,0% 


63,1% 


48,9% 



* at the first renal replacement therapy ; f HCV: Hepatitis C Virus, HBV: Hepatitis B Virus, HIV: Human Immunodeficiency Virus. 



Table 1: List of attributes and weights used by if -Nearest Neighbors algorithms before and after adding the 50 random at- 
tributes, and before and after the stepwise selection procedure of attributes. 



As expected, the attributes have a different impact on the 
registration. Their respective impact reflects on the perfor- 
mance of the K-NN algorithm through the values of the weights 
of attributes. 

When only the 19 relevant factors are considered and with- 
out a stepwise selection procedure, the most relevant predic- 
tive factors seem to be: age, hypertension, ischemic heart 
disease, past history of malignancy, ownership of nephrol- 
ogy facilities and follow-up in institution performing renal 
transplantation. It is worth noting that age and past history 
of malignancy are the only factors with a significant Wald 
test value. After the stepwise selection procedure, LM kept 
the same eight predictive factors where age, hypertension, is- 
chemic heart disease and ownership of nephrology facilities 
showed a significant Wald test value. 

We can notice that the logistic regression performed in 
this study showed results equivalent to those described in re- 
cent literature 1014]. We used both medical and non-medical 
predictive factors of transplant registration. As mentioned in 
Subsection l2.ll non-medical factors might not be relevant for 
clinical practice ; however our main objective is to discuss 
the efficiency of a new computational K-NN and not to meet 
concrete decision-making applications. 

Age in this kind of application field is, with no surprise, 
one of the most relevant clinical factors. As it could be ex- 
pected, it showed a very high weight level compared to other 
factors. This fact might limit the results of the study. Never- 
theless, since we need to design a decision-making process 
that performs automatically, we decided to keep the factor 
age within the discriminating factors in LM and K-NN al- 
gorithms. 

After adding 50 random factors, estimations from the LM 
and the weights of attributes showed a significant change. As 



a matter of fact, the weight of age at the first RRT, for exam- 
ple, decreased from 65% and 69%, respectively before and 
after stepwise attribute selection, to 12% and 24% in the pro- 
tocol arm including the random factors. Overall, the role of 
both the socio-demographic factors and the factors related to 
medical care decreased after the introduction of random fac- 
tors, while the role of clinical and biological factors remained 
stable. The decrease of the values of sociodemographic fac- 
tors' weights and factors related to medical care happened in 
favor of random factors that kept a significant weight on pre- 
diction despite the selection of the attributes by a stepwise 
selection procedure. As expected, adding random factors cre- 
ates an artifact in the definition of the relevant factors and the 
course of the prediction procedure. This artefact help us as- 
sess the robustness of LM combined with K-NN algorithms 
which is discussed in the rest of this Section. 

Figures |2] and |3] show prediction results performed by the 
LM and the CBR methods using the K-NN standalone, the 
K-NN with weighting of either attributes or patients, and us- 
ing the JsT-NN with weighting of both patients and attributes; 
respectively before and after adding 50 random attributes (as 
summarized in Figure [TJ. 

First of all, we evaluate the performance of the algorithms 
in the ideal case with no artifact, i.e., only the 19 relevant at- 
tributes are considered. In this context, results show that pre- 
dictions provided by LM and standalone CBR methods tend 
to be more powerful than methods combining K-NN and LM. 
This is not a surprise as both LM and K-NN are known to be 
quite efficient when the attributes are relevant. 

Right sub-figure in Figure [2] shows the performances of 
the tested algorithm in the ideal case with no artifact, however 
a pre-selection of the attributes in conducted before comput- 
ing the algorithms. We notice that their performances do not 
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(a) Prediction using all attributes (b) Prediction using selected attributes 

Fig. 2: Prediction results before adding 50 random attributes. 
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(a) Prediction using all attributes (b) Prediction using selected attributes 

Fig. 3: Prediction results after adding 50 random attributes. 



significantly change except for the algorithm referred to as 
CBR+oj a (viz, A"-NN with weighted attributes). As a matter 
of fact, we notice that this latter suffers a significant perfor- 
mance decrease. Since a stepwise selection of the attributes is 
conducted before launching the algorithm, i.e., before weight- 
ing the attributes and computing the A-NN algorithm, we can 
conclude that the stepwise attribute selection might discard 
some of the attributes that seem to have a significant impact 
when the attributes are weighted later. 

Then, a similar evaluation is performed after adding 50 
random attributes that, usually, are not considered as relevant. 
In such a scenario, the standalone LM and A'-NN could suffer 
difficulties as the context is not optimally chosen to tune their 
performances. This is indeed observed in Figure |3]where the 
performances of standalone LM and A-NN degrade signifi- 
cantly. 

One of the most interesting results through out Figures |2] 
and [3] is the robustness of the combination of LM and CBR 
when both attributes and patients are weighted. As a matter 
of fact, in all scenarios, with or without artifact, with or with- 
out stepwise attribute selection, the algorithm referred to as 
CBR+uj a +Ljjp performs in a consistent way. It provides for 
all scenarios a prediction rate around 88% ; whereas all other 
algorithms, tested in this paper, seem to suffer at one point 
or another. This robustness offers a performance guaranty. 
This latter might prove to be less efficient than others in some 
specific scenarios, however since in realistic scenarios it is 
usually impossible to tell a priori wheather there is an artifact 
or not, choosing the algorithm that combines both weighted 
attributes and weigthed cases seems to be a cautious choice. 

6. RELATED WORKS AND PERSPECTIVES 

Logistic regression analyses are widely used in medical re- 
search, however it is more commonly reserved for determin- 
ing prognostic factors than for predicting disease. To our 
knowledge, no study evaluates prediction of access to the french 
renal transplant waiting list by LM. 

Bayat et al invested the issue in two recent publications 
using a Bayesian network and a decision tree method IflOl . 
They do not present any AUCs, thus it is not possible to di- 
rectly compare their results with ours. However, they con- 
clude both methods have very high predictive performances 
and age is the most important factor for predicting access to 
the waiting list, which is coherent with our results. 

Similarly, Chuang compared several classifiers including 
LM and CBR methods to predict presence of liver disease 
fl9l . For the author, results related to CBR methods testify 
to the solid diagnosis capacity of CBR in examining healthy 
data. Our results support this conclusion since we have shown 
that CBR method present predictive performances equivalent 
to those obtained by LM. This paper shows however that it 
is true only if the considered attributes are well chosen and 
reliable regarding the problem to solve. 



Nugent et al presented the first association between CBR 
and LM in 2009 with a methodology called KLEF for Knowl- 
edge - Light Explanation Framework 11201 . The method de- 
scribes how gaining high-level knowledge by a top-down mech- 
anism using logistic regression. LM is used a posteriori to 
define one nearest neighbor from cases retrieved by a A-NN 
algorithm. 

LM in the present study was used differently. As a matter 
of fact, the logistic model was directly fitted from the over- 
all knowledge database. Information from LM was a direct 
contribution to compute similarity measures and classifica- 
tion probabilities. This latter approach is described by Stahl 
et al as a bottom-up mechanism ll2Tll . 

To the best of our knowledge, only two publications de- 
scribe methods similar to our hybrid approach. The first one 
is applied to breast cancer diagnosis (Huang et al 11221 ) and 
the second one is applied to the diagnosis of liver disease 
(Chuang lfl9ll ). 

In Chuang' s paper, CBR methodology is different from 
the one applied in the present study. As a matter of fact, 
similarity measures are performed separately for cases with 
and without liver disease. Thus, in Huang's paper, similarity 
computation is performed through a A-NN algorithm as in 
the present work. However, LM is only used for defining the 
most relevant factors and to compute attribute weights. 

In the present study, LM is also used to perform attribute 
selection and attribute weighting. However, we proposed in 
addition to introduce Pearson residuals to weight the cases 
in the design of our A-NN algorithm. In our opinion, Pear- 
son residuals based case weighting participate, with attribute 
weighting, to the cases' description ans specification when 
defining problem-specific knowledge [6 1. Thus, LM defines 
an archetype of registered and not registered patients in the 
knowledge database, and LM residuals reflect the adequacy 
of each patients with regard to the archetype. Relying only on 
logistic regression coefficients or stepwise selection to define 
the cases as well as the problem utility would consider that 
all patients match perfectly the LM archetype. We know for a 
fact that it is not true. Hence, computing specific weights for 
each case, relying on LM residuals, appears as an attempt to 
correct of that approximation. To the best of our knowledge, 
this is the first time that such an approach is discussed in the 
literature. 

As for Chuang' s paper, the author points out classification 
improvements relying on Hybrid CBR approach compared to 
a standalone CBR. Huang's publication also compares several 
kinds of hybrid approaches: a neural network with or with- 
out fuzzy logic and two hybrid CBR systems, one combining 
CBR with a decision tree and one combining CBR with LM. 
The neural networks show superior performances, but the au- 
thors emphasized rapidity of cases retrieval and the more eas- 
ily interpretable results of CBR methodology. 

In the present study, the CBR hybrid approaches did not 



show significant improvements forpatient classification, com- 
pared to standalone CBR approach. However, the hybrid CBR 
system combing both attribute weighting and case weighting 
seems to be very robust to artifacts in the database that might 
occur in all realistic scenarios. From our point of view, this 
interesting observation provides new perspectives for future 
CBR system, particularly for integrating CBR systems into 
large and unspecific knowledge database as electronic health 
records I 
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